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Abstract — Signal modeling lies at the core of numerous signal 
and image processing applications. A recent approach that has 
drawn considerable attention is sparse representation modeling, 
in which the signal is assumed to be generated as a combination 
of a few atoms from a given dictionary. In this work we consider 
a Bayesian setting and go beyond the classic assumption of 
independence between the atoms. The main goal of this paper 
is to introduce a statistical model that takes such dependencies 
into account and show how this model can be used for sparse 
signal recovery. We follow the suggestion of two recent works 
and assume that the sparsity pattern is modeled by a Boltzmann 
machine, a commonly used graphical model. For general depen- 
dency models, exact MAP and MMSE estimation of the sparse 
representation becomes computationally complex. To simplify the 
computations, we propose greedy approximations of the MAP 
and MMSE estimators. We then consider a special case in which 
exact MAP is feasible, by assuming that the dictionary is unitary 
and the dependency model corresponds to a certain sparse graph. 
Exploiting this structure, we develop an efficient message passing 
algorithm that recovers the underlying signal. When the model 
parameters defining the underlying graph are unknown, we 
suggest an algorithm that learns these parameters directly from 
the data, leading to an iterative scheme for adaptive sparse signal 
recovery. The effectiveness of our approach is demonstrated on 
real-life signals - patches of natural images - where we compare 
the denoising performance to that of previous recovery methods 
that do not exploit the statistical dependencies. 

Index Terms — Sparse representations, signal synthesis, 
Bayesian estimation, MAP, MRF, Boltzmann machine, greedy 
pursuit, unitary dictionary, decomposable model, message 
passing, pseudo-likelihood, SESOP, image patches, denoising. 



I. Introduction 

Signal modeling based on sparse representations is used 
in numerous signal and image processing applications, such 
as denoising, restoration, source separation, compression and 
sampling (for a comprehensive review see HI). In this model 
a signal y is assumed to be generated as y — Ax + e, where 
A is the dictionary (each of the columns in A is typically 
referred to as an atom), a; is a sparse representation over this 
dictionary, and e is additive white Gaussian noise. Throughout 
this work we shall assume that the dictionary is known and 
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fixed, and our derivations consider both arbitrary and unitary 
dictionaries. Our goal is to recover the sparse representation x 
from y, and by multiplying the outcome by A we can achieve 
denoising. We will use the term "sparse signal recovery" (or 
just "sparse recovery" and "signal recovery") to describe the 
task of recovering x from y for both the case of arbitrary 
dictionaries and unitary ones. 

Various works that are based on this model differ in 
their modeling of the sparse representation x. The classi- 
cal approach to sparse recovery considers a deterministic 
sparse representation and signal recovery is formulated as a 
deterministic optimization problem. Some examples include 
greedy pursuit algorithms like orthogonal matching pursuit 
(OMP) and CoSaMP, and convex relaxations like basis pursuit 
denoising and the Dantzig selector (for comprehensive reviews 
see Q] , |0). Recent works |0, 0, 0, 0, Q, 10 suggested 
imposing additional assumptions on the support of x (the 
sparsity pattern), which is still regarded deterministic. These 
works show that using structured sparsity models that go 
beyond simple sparsity can boost the performance of standard 
sparse recovery algorithms in many cases. 

Two typical examples for such models are wavelet trees 
101 and block-sparsity 0, @. The first accounts for the fact 
that the large wavelet coefficients of piecewise smooth signals 
and images tend to lie on a rooted, connected tree structure 
|0. The second model is based on the assumption that the 
signal exhibits special structure in the form of the nonzero 
coefficients occurring in clusters. This is a special case of a 
more general model, where the signal is assumed to lie in 
a union of subspaces \A\, [5|. Block-sparsity arises naturally 
in many setups, such as recovery of multi-band signals |10|, 
iTTTl and the multiple measurement vector problem. However, 
there are many other setups in which sparse elements do not 
fit such simple models. In [7| the authors propose a general 
framework for structured sparse recovery and demonstrate 
how both block-sparsity and wavelet trees can be merged into 
standard sparse recovery algorithms. 

In many applications it can be difficult to provide one 
deterministic model that describes all signals of interest. For 
example, in the special case of wavelet trees it is well known 
that statistical models, such as hidden Markov trees (HMTs) 
[12 1, are more reliable than deterministic ones. Guided by this 
observation, it is natural to consider more general Bayesian 
modeling, in which the sparse representation is assumed to be 
a random vector. Many sparsity-favoring priors for the repre- 
sentation coefficients have been suggested in statistics, such 
as the Laplace prior, "spike-and-slab" (mixture of narrow and 
wide Gaussian distributions) and Student's t distribution (for a 
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comprehensive review see |[T3l ). However, the representation 
coefficients are typically assumed to be independent of each 
other. 

Here we are interested in Bayesian modeling that takes into 
account not only the values of the representation coefficients, 
but also their sparsity pattern. In this framework sparsity is 
achieved by placing a prior distribution on the support, and the 
representation coefficients are modeled through a conditional 
distribution given the support. The most simple prior for the 
support assumes that the entries of the sparsity pattern are 
independent and identically distributed (i.i.d.) (see e.g. fl4l ). 
However, in practice, atoms in the dictionary are often not 
used with the same frequency. To account for this behavior, 
we can relax the assumption that the entries are identically 
distributed and assign different probabilities to be turned "on" 
for each entry lfl5l . 

Besides the modeling aspect, another key ingredient in 
Bayesian formulations is the design objective. Two popular 
techniques are maximum a posteriori (MAP) and minimum 
mean square error (MMSE) estimators. Typically these esti- 
mators are computationally complex, so that they can only be 
approximated. For example, approximate MAP estimation can 
be performed using a wide range of inference methods, such 
as the relevance vector machine lfl6l and Markov chain Monte 
Carlo (MCMC) El- Such estimators are derived in lfl3l, lfl8l 
based on sparsity-favoring priors on x and approximate infer- 
ence methods. In fl4l . |fl9ll approximate MMSE estimators are 
developed, based on an i.i.d prior on the support. Finally, in 
the special case of a square and unitary dictionary, assuming 
independent entries in the support and Gaussian coefficients, 
it is well known that the exact MAP and MMSE estimators 
can be easily computed fl5ll . l20l . 

Independence between the entries in the support can be a 
useful assumption, as it keeps the computational complex- 
ity low and the performance analysis simple. Nevertheless, 
this assumption can be quite restrictive and leads to loss 
of representation power. Real-life signals exhibit significant 
connections between the atoms in the dictionary used for their 
synthesis. For example, it is well known that when image 
patches are represented using the discrete cosine transform 
(DCT) or a wavelet transform, the locations of the large 
coefficients are strongly correlated. Recent works ll2"Tl . Il22ll . 
|23|, 1241 . Il25ll have made attempts to go beyond the classic 
assumption of independence and suggested statistical models 
that take dependencies into account. The special case of 
wavelet trees is addressed in ETl . where HMTs are merged 
into standard sparse recovery algorithms, in order to improve 
some of their stages and lead to more reliable recovery. An- 
other statistical model designed to capture the tree structure for 
wavelet coefficients, was suggested in 11221 . An approximate 
MAP estimator was developed there based on this model and 
MCMC inference. 

Here we consider more general dependency models based 
on undirected graphs, which are also referred as Markov 
random fields (MRFs), and focus on the special case of a 
Boltzmann Machine (BM). To the best of our knowledge a 
BM structure for sparsity patterns was originally suggested 
in lf23l in the context of Gabor coefficients. MCMC inference 



was used there for non-parametric Bayesian estimation. In l24ll 
the authors also use a BM structure, which allows them to 
introduce the concept of interactions in a general sparse coding 
model. An approximate MAP estimator is then developed 
by means of Gibbs sampling and simulated annealing ifTTIl . 
Finally, in l25l a BM prior on the support is used in order 
to improve the CoSaMP algorithm. We will relate in more 
detail to the recent works which used BM-based modeling and 
emphasize differences between these works and our approach 
in Section IXl 

The current paper is aimed at further exploring the BM- 
based model proposed in 11231 . (24), ll25l . Once we adopt the 
BM as a model for the support, several questions naturally 
arise: how to perform pursuit for finding the sparse represen- 
tation, how to find the model parameters, and finally how to 
combine these tasks with dictionary learning. In this paper we 
address the first two questions. For pursuit we suggest using 
a greedy approach, which approximates the MAP and MMSE 
estimators and is suitable for any set of model parameters. We 
then make additional modeling assumptions, namely a unitary 
dictionary and a banded interaction matrix, and develop an 
efficient message passing algorithm for signal recovery which 
obtains the exact MAP estimate in this setup. For learning 
the Boltzmann parameters we suggest using a maximum 
pseudo-likelihood (MPL) approach and develop an efficient 
optimization algorithm for solving the MPL problem. Finally, 
we use a block-coordinate optimization approach to estimate 
both the sparse representations and the model parameters 
directly from the data. This results in an iterative scheme for 
adaptive sparse signal recovery. 

The paper is organized as follows. In Section[II]we motivate 
the need for inserting probabilistic dependencies between 
elements in the support by considering sparse representations 
of image patches over a DCT dictionary. In Section [Til] we 
introduce useful notions and tools from the graphical models 
field and explore the BM prior. Section [IV] defines the signal 
model, along with the MAP and MMSE estimation problems. 
In Section [V] we develop several greedy approximations of the 
MAP and MMSE estimators for the BM prior. We then present 
setups where the MAP problem can be solved exactly and 
develop an efficient algorithm for obtaining the exact solution 
in Section[VT] We explore the performance of these algorithms 
through synthetic experiments in Section I VIII Estimation of 
the model parameters and adaptive sparse signal recovery are 
addressed in Section [Villi The effectiveness of our approach 
is demonstrated on image patches in Section [IX] Finally, we 
discuss relations to past works in Section [X] 

II. Motivation 

In this section we provide motivation for inserting prob- 
abilistic dependencies between elements in the support. We 
consider a set of N = 100, 000 patches of size 8-by-8 that 
are extracted out of several noise-free natural images. For 
each patch, we perform a preliminary stage of DC removal 
by subtracting the average value of the patch, and then obtain 
sparse representations of these patches over an overcomplete 
DCT dictionary of size 64-by-256 (n-by-m) using the OMP 
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algorithm. We consider a model error of a — 2, so that OMP 
stops when the residual error falls below e = ^/na = 16. 
We then compute the empirical marginal distributions for 
each of the dictionary atoms and for all pairs of atoms, 
namely we approximate Pr(5, = 1), i — 1, ...,m and 
Pr(Si = l,Sj = 1), i = l,...,m — 1, j > i, where S 
is a binary vector of size m and Si = 1 denotes that the 
ith atom is being used. The empirical conditional probability 
Pr(Si — l\Sj — 1) can then be computed as the ratio between 
Pr(S z = 1, Sj = 1) and Pr(S,- = 1). 

We address several assumptions that are commonly used in 
the sparse recovery field and suggest validity tests for each 
of them. The first assumption is that the elements in the 
support vector are identically distributed, namely Pr(Sj = 
1) — p for all i, where < p < 1 is some constant. 
This assumption can be examined by comparing the marginal 
probabilities Pr(5^ = 1) for each atom. The second as- 
sumption is independency between elements in the support. 
The independency assumption between atoms i and j implies 
that Pr(5 4 = l\Sj = 1) = Pr(S' J = 1). Therefore, we 
can test for independency by comparing the marginal and 
conditional probabilities for each pair of atoms. Next we turn 
to the block-sparsity assumption. Assuming that i and j are 
in the same cluster implies that the conditional probabilities 
Pr(5' 4 = l\Sj = 1) and Pr(£y = l\Si = 1) are near 1. 

To examine the validity of each of the above-mentioned 
assumptions, we compute the variables 
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Pr(S, = 1) 



1 < i < m 



logio 



Pv(S l = lj^=_l) 
Pr(5, = 1) 



1 < i,j < m 



(1) 



Vij = |log 10 (Pr(£ = l\S j = l) + 5)\ ) 1 < i,j < m 



where p denotes the average probability of an entry to be 
turned "on" , namely p = — Y%Li P r ($J = 1), R is a vector 
of size m and U, V are matrices of size m-by-m. We use 
S = 0.1, so that for Pr(5i = l\Sj = 1) = we get a 
value 1 in Uij and Vij (i and j denote the row and column 
indices respectively). In each of the functions in ([]]) a near- 
zero result implies that the corresponding assumption is valid; 
as we go further away from zero the validity of the assumption 
decreases. The logarithms are used to improve the visibility 
of the results. 

The results are shown in Fig.Q] On the left we plot the val- 
ues in R. This plot demonstrates that the individual frequencies 



can be very far from the average one. Consequently, the DCT 
atoms are used with varying frequencies. The matrix U is 
displayed in the middle. The black color, which corresponds 
to near-zero values, is dominant. This illustrates that the inde- 
pendency assumption is satisfactory for many pairs of DCT 
atoms. However, some pairs exhibit significant interactions 
(see the white diagonals near the main diagonal and the bright 
spots). The image on the right displays the matrix V, which 
is dominated by the white color, corresponding to near-one 
values. High values in the entries Vij or Vj t i indicate that 
it is not reasonable to assume that the corresponding atoms 
belong to the same cluster in a block-sparse model (regardless 
of the block sizes). Since this is the case for most pairs of 
DCT atoms, we conclude the block-sparsity approach does 
not capture the dependencies well in this example. 

It is interesting to note that while the OMP algorithm 
reveals different frequencies of appearance for the atoms and 
significant correlations between pairs of atoms, it in fact makes 
no use of these properties. Therefore, it seems plausible that 
a stochastic model that will capture the different nature of 
each atom, as well as the important interactions between the 
atoms, can lead to improved performance. In this paper we 
will show how this can be accomplished in a flexible and 
adaptive manner. In Section [IX] we will return to this very set 
of patches and show that the proposed model and methods do 
better service to this data. 



III. Background on Graphical Models 

The main goal of this paper is using graphical models 
for representing statistical dependencies between elements in 
the sparsity pattern and developing efficient sparse recov- 
ery algorithms based on this modeling. In order to set the 
ground for the signal model and the recovery algorithms, we 
provide some necessary notions and methods from the vast 
literature on graphical models. We begin by presenting MRFs 
and explain how they can be used for describing statistical 
dependencies. We then focus on the BM, a widely used 
MRF, explore its properties and explain how it can serve 
as a useful and powerful prior on the sparsity pattern. For 
computational purposes we may want to relax the dependency 
model. One possible relaxation, which often reduces compu- 
tational complexity and still bears considerable representation 
power, is decomposable models. Finally, we present a powerful 
method for probabilistic inference in decomposable models, 
coined belief propagation. Decomposability will be a modeling 
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assumption in Section [Vj] and the algorithm we propose in 
Section IVI-Bl will be based on belief propagation techniques. 

A. Representing Statistical Dependencies by MRFs 

In this subsection we briefly review MRFs and how they 
can be used to represent statistical dependencies. This re- 
view is mainly based on ||26l . A graphical model is defined 
by its structural and parametric components. The structural 
component is the graph G = (V, e) where V is a set of 
nodes (vertices) and e is a set of undirected edges (links 
between the nodes). In a graphical model there is a one- 
to-one mapping between nodes {1,2, ...,m} and random 
variables {Si, S%, . . . , S m }. Let Sa,Sb,Sc stand for three 
disjoint subsets of nodes. We say that Sa is independent of 
Sc given Sb if Sb separates Sa from Sc, namely all paths 
between a node in Sa and a node in Sc pass via a node in 
Sb- Thus, simple graph separation is equivalent to conditional 
independence. The structure can be used to obtain all the 
global conditional independence relations of the probabilistic 
model. By "global" we mean that conditional independence 
holds for all variable assignments and does not depend on 
numerical specifications. For a visual demonstration see Fig. 
|2f a); using the above definition it easy to verify for example 
that Si is independent of S4, S5 given S2, S3. 

Turning to the parametric component, note that the joint 
probability distribution is represented by a local parametriza- 
tion. More specifically, we use a product of local nonnegative 
compatibility functions, which are referred to as potentials. 
The essence of locality becomes clearer if we define the 
notion of cliques. A clique is defined as a fully-connected 
subset of nodes in the graph. If Sj and Sj are linked, 
they appear together in a clique and thus we can achieve 
dependence between them by defining a potential function on 
that clique. The maximal cliques of a graph are the cliques 
that cannot be extended to include additional nodes without 
losing the property of being fully connected. Since all cliques 
are subsets of one or more maximal cliques, we can restrict 
ourselves to maximal cliques without loss of generality. For 
example, in Fig. |2ja) the maximal cliques are C\ = {1, 2, 3}, 
C2 = {2,3,4} and C3 = {3,4,5}. To each maximal clique 
C we assign a nonnegative potential ^c(Sc)- The joint 
probability is then given as a product of these potentials, up 
to a normalization factor Z: 

Pr(S) = !lI *c(Sc). (2) 

c 

If the potentials are taken from the exponential family, namely 

^c(Sc) = cxp {-E c (Sc)h then Pr(S) = \ exp{-^(S)}, 
where E(S) = J2c ^c(Sc) is the energy of the system. 

B. The Boltzmann Machine 

In this subsection we focus on the BM, a widely used MRF 
We are about to show that this can serve as a useful and 
powerful prior on the sparsity pattern. The BM distribution is 
given by: 

Pr(S) = - exp ( b T S + ls T Ws] , (3) 




(a) Graph (b) Interaction matrix 



Figure 2. A simple dependency model for 5 variables. This is a chordal 
graph with 3 missing edges. The interaction matrix in the corresponding BM 
is banded. 

where S is a binary vector of size m with values in {—1, l} m , 
W is symmetric and Z is a partition function of the Boltzmann 
parameters W, b that normalizes the distribution. We can 
further assume that the entries on the main diagonal of W are 
zero, since they contribute a constant to the function S T WS. 
In this work the BM will be used as a prior on the support 
of a sparse representation: Si = 1 implies that the ith atom is 
used for the representation, whereas for Sj = — 1 this atom is 
not used. 

The BM is a special case of the exponential family with 
an energy function E(S) = ~b T S - \S T WS. The BM 
distribution can be easily represented by an MRF - a bias 
bi is associated with a node i and a nonzero entry Wij in 
the interaction matrix results in an edge connecting nodes i 
and j with the specified weight. Consequently, the zero entries 
in W have the simple interpretation of missing edges in the 
corresponding undirected graph. This means that the sparsity 
pattern of W is directly linked to the sparsity of the graph 
structure. From graph separation we get that if Wij = 
then Si and Sj are statistically independent given all their 
neighbors {Si} ie N(i)uN(j), l&j- ^ or example, if the matrix 
W corresponds to the undirected graph that appears in Fig. 
I2a) then W14 = W15 — W25 = 0. This matrix is shown in 
Fig. 0b). 

The maximal cliques in the BM are denoted by 
Ci, . . . , Cp and we would like to assign potential functions 
{^d {Sci)}f = i t0 these cliques that will satisfy the require- 
ment exp \b T S + \S T WS) = n^Li (S C J- One possible 
choice is to assign each of the terms in E(S) using a pre- 
specified order of the cliques: biSi is assigned to the clique 
that consists of Si and appears last in the order and a non-zero 
term WijSiSj is assigned to the clique that consists of Si, Sj 
and appears last in the order. 

Next, we turn to explore the intuitive meaning of the Boltz- 
mann parameters. In the simple case of W = 0, the BM dis- 
tribution becomes Pr(S) = 4 II"=i ex P 0>iSi). Consequently, 
{Si}iLi Slle statistically independent. Using straightforward 
computations we get Pr(Si = —1) = exp(— 26,) Pr(Si = 1) 
for i = 1, . . . , m. Since Pr(S l = -1) + Pr(S; = 1) = 1, Si 
has the following marginal probability to be turned "on": 

Vl 4 Pr(S 4 = 1) = — 1 1 < i < m. (4) 

1 + exp(-2f>i) 

When W is nonzero, (0]) no longer holds. However, the simple 
intuition that Sj tends to be turned "off" as 6, becomes more 
negative, remains true. 

We would now like to understand how to describe corre- 
lations between elements in S. To this end we focus on the 
simple case of a matrix W of size 2-by-2, consisting of one 
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parameter Wn, and provide an exact analysis for this setup. 
In order to simplify notations, from now on we use p^j(u\v) 
to denote Pr(Si = u\Sj = v). Using these notations we can 
write down the following relation for the simple case of a pair 
of nodes: 



where 



Pi =Pl|2(l|l)P2 +Pl| 2 (l| - l)(l-p 2 ), 



(5) 




9 Cl {S Cl ) m 21 *c 2 (S C2 ) m a *c 3 (Sc 3 ) 



Figure 3. A clique tree which is constructed for the graph that appears in 
Fig. [2] In this case the clique tree takes the form of a simple chain of size 3. 
Potential functions are defined for each of the cliques and exact probabilistic 
inference is performed by message passing. 



Pl l 2(1|1)= l + exp(-26 1 -2W 12 ) 
Pl|2(1|_1)= l + exp(-26 1 + 2W 12 ) 



(6) 



From © we see that p\ is a convex combination ofp 1 | 2 (l| — 1) 
and p 1 | 2 (l|l). Hence, for W12 > we have j>x| 2 (1 1 — 1) < 
Pi < Pi|2(l|l) an d f° r W12 < we have p 1 | 2 (l|l) < pi < 
Pi| 2 (l|-1). 

For a general matrix W these relations are no longer strictly 
accurate. However, they serve as useful rules of thumb: for 
an "excitatory" interaction (Wy > 0) Si and Sj tend to be 
turned "on" ("off") together, and for an "inhibitory" interaction 
(Wij < 0) Si and Sj tend to be in opposite states. The intuition 
into the Boltzmann parameters provides some guidelines as to 
how the BM prior can be used for sparse representations. If 
the values of the biases in the vector b are negative "enough" 
and there are few strong excitatory interactions, then the mean 
cardinality of the support tends to be small. This reveals some 
of the power of the BM as a prior on the support in the signal 
model. It can achieve sparsity and at the same time capture 
statistical dependencies and independencies in the sparsity 
pattern. 

To conclude this section, note that standard sparsity-favoring 
models can be obtained as special cases of the BM model. For 
W — and hi — i In ( fzjj f° r a ^ which correspond to 
an i.i.d. prior, the cardinality k has a Binomial distribution, 
namely k ~ Bin(p, m). For a low value of p the cardinalities 
are typically much smaller than m, so that plain sparsity is 
achieved. BM can also describe a block-sparsity structure: 
Assuming that the first k\ entries in S correspond to the first 
block, the next k 2 to the second block, etc., the interaction 
matrix W should be block-diagonal with "large" and positive 
entries within each block. The entries in b should be chosen 
as mentioned above to encourage sparsity. 

C. Decomposable Graphical Models 

We now consider decomposability in graphical models |26l . 
|27|. A triplet {A, B, C} of disjoint subsets of nodes is a 
decomposition of a graph if its union covers all the set V, B 
separates A from C and B is fully-connected. It follows that 
a graphical model is regarded as decomposable if it can be 
recursively decomposed into its maximal cliques, where the 
separators are the intersections between the cliques. It is well 
known that a decomposable graph is necessarily chordal |28|. 
This means that each of its cycles of four or more nodes has a 
chord, which is an edge joining two nodes that are not adjacent 
in the cycle. Consequently, for a given MRF we can apply a 
simple graphical test to verify that it is decomposable. 



In Section |VT] we consider decomposable BMs. This as- 
sumption implies that the matrix W corresponds to a chordal 
graph. We now provide some important examples for decom- 
posable graphical models and their corresponding interaction 
matrices. Note that a graph which contains no cycles of length 
four is obviously chordal as it satisfies the required property 
in a trivial sense. It follows that a graph with no edges, a 
graph consisting of non-overlapping cliques and a tree are all 
chordal. The first example is the most trivial chordal graph and 
corresponds to W = 0. The second corresponds to a block- 
diagonal matrix and as we explained in Section IIII-BI it can 
describe a block-sparsity structure. Tree structures are widely 
used in applications that are based on a multiscale framework. 
A visual demonstration of the corresponding matrix is shown 
in E3. 

Another common decomposable model corresponds to a 
banded interaction matrix. In an Lth order banded matrix only 
the 2L + 1 principal diagonals consist of nonzero elements. 
Assuming that the main diagonal is set to zero, we have that 
there can be at most (2m— (L + 1))L nonzero entries in an Lth 
order banded W, instead of m? — m nonzeros as in a general 
interaction matrix. Consequently, the sparsity ratio of W is of 
order L /m. This matrix corresponds to a chordal graph with 
cliques C,: = {Si, . . . , Si+l} , i = 1, . . . , m—L. For example, 
the matrix in Fig. |2b) is a second order banded matrix of size 
5-by-5. This matrix corresponds to a chordal graph (see Fig. 
|2ja)) with three cliques. 

Chordal graphs serve as a natural extension to trees. It 
is well known [26 1 that the cliques of a chordal graph can 
be arranged in a clique tree, which is called a junction tree. 
In a junction tree T each clique serves as a vertex and any 
two cliques containing a node v are either adjacent in T or 
connected by a path made entirely of cliques containing v. 
For a visual demonstration see Fig. [3] where a clique tree is 
constructed for the chordal graph of Fig. Ufa). In this case 
where the interaction matrix is banded, the clique tree is 
simply a chain. It can easily be verified that this is in fact 
true for a banded interaction matrix of any order. 

We now turn to describe belief propagation, a powerful 
method for probabilistic inference tasks like computation of 
single node marginal distributions and finding the most prob- 
able configuration. Exact probabilistic inference can become 
computationally infeasible for general dependency models as 
it requires a summation or maximization over all possible 
configurations of the variables. For example, in a general 
graphical model with m binary variables the complexity 
of exact inference grows exponentially with m. However, 
when the graph structure is sparse, one can often exploit the 
sparsity in order to reduce this complexity. The inference 
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tasks mentioned above can often be performed efficiently 
using belief propagation techniques [26|. More specifically, 
in a decomposable MRF exact inference takes the form of 
a message passing algorithm, where intermediate factors are 
sent as messages along the edges of the junction tree (see for 
example the messages passed along the chain in Fig. 0. For 
more details on message passing see l|26l . 

The complexity of exact inference via message passing 
strongly depends on the tree-width of the graph. In a de- 
composable model this is defined as the size of the largest 
maximal clique minus one. For example, in the special case 
of a BM with an ith order banded W we have that the tree- 
width is L. We can conclude that for a decomposable model 
there is an obvious tradeoff between computational complexity 
and representation power. For example, in the special case of 
an Lth order interaction matrix the computational complexity 
of exact inference decreases with L, but at the same time 
the graphical model captures fewer interactions. Nevertheless, 
decomposable models can serve as a useful relaxation for a 
general dependency model, as they can achieve a substantial 
decrease in the complexity of exact inference, while still 
capturing the significant interactions. 

IV. BM Generative Model 

In this section we use the BM for constructing a stochastic 
generative signal model. We consider a signal y which is 
modeled as y = Ax + e, where A is the dictionary of size 
ri-by-m, x is a sparse representation over this dictionary and 
e is additive white Gaussian noise with variance g\. We denote 
the sparsity pattern by S G { — 1, l} m , where Si = 1 implies 
that the index i belongs to the support of x and Si = — 1 
implies that Xi = 0. The nonzero coefficients of x are denoted 
by x s , where s is the support of x. Following [24 1 we consider 
a BM prior for S and a Gaussian distribution with zero mean 
and variance a\ i for each nonzero representation coefficient 
Xi. Note that the variances of the non-zero representation 
coefficients are atom-dependent. It follows that the conditional 
distribution of x s given the support s is 



Pr(z s |s) 



1 



■ exp 



1 Ty^-l, 

_ X r. 2—1 - X i 

2 S S . 



(7) 



det(27rS s ) 1/2 

where E s is a k x k diagonal matrix with diagonal elements 
(E s )i,i — &x s an d & i s ^e cardinality of the support s. 
Using the assumption on the noise we can also write down 
the conditional distribution for the signal y given its sparse 
representation: 



Pr(y|a 



1 



(27R7*) 



i./2 



exp 



1 

2^ 



IV- 



A s x s || 2 



(8) 



The proposed generative model combines a discrete dis- 
tribution for S and continuous distributions for x given S 
and y given x, so that computations of posterior distributions 
should be handled carefully. Notice that an empty support s 
necessarily implies x — 0, so that Pr(x = 0) is a discrete 
distribution (it equals Pr(5 = — l mx1 )). However, for nonzero 
vectors v we have that Pr(x = v) is a continuous distribution. 
Using Bayes' law we can deduce that just like Pr(x), the 



posterior Pi(x\y) is a mixture of a discrete distribution for 
x = and a continuous distribution for all nonzero values of 
x. Our goal is to recover x given y. However, from the above 
discussion we have that given y, the representation vector 
x equals zero with a nonzero probability, whereas for any 
nonzero vector v the event x — v occurs with probability 
zero. It follows that the MAP estimator for x given y leads to 
the trivial solution x — 0, rendering it useless. 

The distribution Pr(s|y) however is a discrete one. There- 
fore, we suggest to first perform MAP estimation of s given y 
and then proceed with MAP estimation of x given y and the 
estimated support s [15]. This suggestion aligns with previous 
approaches in the sparse recovery field. In fact, standard 
algorithms for sparse recovery, such as OMP, take a similar 
approach - they first obtain an estimate for the support which 
minimizes the residual error and then rely on this estimate 
for signal reconstruction. Indeed, even the celebrated ^i-norm 
minimization approach is often used as a means to find the 
support, followed by a least-squares step for finding the final 
representation values (this is known as debiasing). 

We begin by developing an expression for Pr(y|s) by 
integrating over all possible values of x s el 1 : 



Pr(y|s) = / Pr(y\x 3 ,s)Pr(x s 

Jx 3 £R k 



\s)dx s 



C- 



det [ -KAJA B Y, S + I 



where C = 1 /(2ira^)" /2 exp | — ^rlMl!} i s a constant and 
Q s = A^Ag + o-gEjT 1 . This leads to the following estimator 
for the support: 



s MAP =argmaxPr(s|y) = argmaxPr(?/|s) Pr(s) 



sen 



sen 



(10) 



=argmax— -y T A s Q s 1 A T s y- 
Un{det (Q S )) + ^S T WS+ (b-^v) S 



where Vi = In (^l.i/al) and S depends on s through Si = 
2-l[i € s] — 1 for all i, with 1 [•] denoting the indicator function. 
The feasible set f2 denotes all 2 m possible supports. In terms 
of S, this is the set of all vectors satisfying Sf = 1 for all 
i. Note that for an empty support the two first terms in ( ITOl ) 
vanish. Once we have an estimate s — smap of the support, 
we can compute a MAP estimator of x using the same formula 
as in the oracle estimator (see |fT31 ): 

irs MAP = argmaxPr(^|y,s) = Qj x A^y. (11) 

x s £R k 

We now turn to MMSE estimation of x given y. Here we 
have: 

xmmse = E[x\y] = Pr ( s \v) E l x \y> s ]> ( 12 ) 

sen 

where E\x\y, s] equals argmaxPr (x\y, s) ifTBI (this is not true 

X 

in general, but rather for the specific distribution considered 
here) and is computed using the oracle formula: E[x s \y, s] = 
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Q~ 1 A^y for the on support entries (and the off support entries 
are set to zero). 

In the sequel we first focus on the case where all model 
parameters - the Boltzmann parameters W, b, the variances 
{cr^j} , the dictionary A and the noise variances <j\ are 
known. For a general dictionary A and an arbitrary symmetric 
interaction matrix W the exact MAP and MMSE estimators 
require an exhaustive search or sum over all 2 m possible 
supports. To overcome the infeasibility of the combinatorial 
search or sum, we consider two approaches. In the first, 
developed in Section [V] we approximate the MAP and MMSE 
estimators using greedy methods. An alternative strategy is 
to make additional assumptions on the model parameters, 
namely on A and W, that will make exact estimation feasible. 
This approach is addressed in Section |VI] where we consider 
unitary dictionaries A and decomposable BMs. The more 
practical setup where the model parameters are also unknown 
is considered in Section IVIIII for which we derive efficient 
methods for estimating both the sparse representations and the 
model parameters from a set of signals. 

V. Greedy Pursuit for Approximate MAP and 

MMSE ESTIMATION 

Throughout this section we assume an arbitrary dictionary 
and an arbitrary symmetric interaction matrix and make use of 
the BM-based generative model to solve a fundamental sparse 
coding problem - finding the sparse representation of a signal 
from noisy observations. As we have seen in the previous 
section, exact MAP and MMSE estimation in this setup require 
an exhaustive search or sum over all 2 m possible supports. 
To simplify the computations, we propose using a greedy 
approach. In this section we suggest three greedy pursuit algo- 
rithms for our model-based sparse recovery problem. The two 
first algorithms are OMP-like and thresholding-like pursuits 
which approximate the MAP estimate of the support s given 
the signal y. The third pursuit method is a randomized version 
of the proposed OMP-like algorithm (similar to the rand-OMP 
method |fl9l ), which approximates the MMSE estimate of the 
representation vector x given the signal y. 

A. OMP-like MAP 

We begin with the OMP-like algorithm and explain its 
core idea. Our goal is to estimate the support which achieves 
the maximal value of the posterior probability Pr(5|y). This 
means that our objective function is the one that appears 
in ([Tol l. We start with an empty support, which means that 
{Si}7=i are a ^ — At the first iteration, we check each of the 
m possible elements that can be added to the empty support 
and evaluate the term in ( TTOb . The entry i* leading to the 
largest value is chosen and thus Si t is set to be +1. Given the 
updated support, we proceed exactly in the same manner. In 
every iteration we consider all the remaining inactive elements 
and choose the one that leads to the maximal value in (ITOb 
when added to the previously set support. The algorithm stops 
when the value of (TTOb is decreased for every additional item 
in the support. 



In each iteration only one entry in S changes - from —1 to 
1. This can be used to simplify some of the terms that appear 
in ([Toll: 

l -S T WS = lYl W ^ S ^ =C 1+ 2J2 WijSj 



]Tln (< 4 /«r2)5 i = C 3 + 21n« i ) 

i=l 

where C\ , C% , C3 are constants that will not be needed in our 
derivation. Consequently, in each iteration it is sufficient to 
find an index i (out of the remaining inactive indices) that 
maximizes the following expression: 



1 



1 



Val(i) =^ y T A sk QjA^y- ^ In (|det (Q a *)l) 
2WfS fe + 26,-1 In (o* ti ) 



(14) 



where st is the support estimated in iteration k — 1 with the 
entry i added to it, Q s k — A^ k A s k + CgE^ 1 and Wf is 
the ith row of W. A pseudo-code for the proposed OMP-like 
algorithm is given in Algorithm [JJ 

Algorithm 1 Greedy OMP-like algorithm for approximating 
the MAP estimator of ( fTOb 

Input: Noisy observations y E R" and model parameters 

W,b, {<Tz,i}™ x ,A, a e . 
Output: A recovery Jmap for the support. 

s o = 05 s« = -l mxl 

k = 1 
repeat 



for 1 & s 

„k _ „k-l 



1 do 

.< ^ U i 



5* = S^Ui„ S*[j] 



S k [j] = . , 

Evaluate Val(i) using (fT4|) . 
end for 

i* = argmax^ {Val(i)} 

v 1 . 3 = i* 

k = k + l 

until Pr (s£|t/) < Pr (sj _1 |y) 

Return: s M ap = s*~ l 

We now provide some intuition for the expressions in ( fl4] >. 
The term y T A s kQ~ k A^ k y is equivalent to the residual error 

||r fe || 2 , where r k — y — A s k (Aj fc A s fc) 1 A T &k y is the residual 
with respect to the signal. To see this, notice that the following 
relation holds: 

||r fc ||; = \\y\g - y T A s k (A^y 1 A T sk y. (15) 

Using the definition of Q s k it can be easily verified that the 
two terms take a similar form, up to a regularization factor 
in the pseudoinverse of A s k. Next, we turn to the terms 



s 



S and hi. The first corresponds to the sum of interactions 
between the zth atom and the rest of the atoms which arise 
from turning it on (the rest remain unchanged). The second 
term is the separate bias for the ith atom. As the sum of 
interactions and the separate bias become larger, using the ith 
atom for the representation leads to an increase in the objective 
function. Consequently, the total objective of ([T4l > takes into 
consideration both the residual error with respect to the signal 
and the prior on the support. This can lead to improved 
performance over standard pursuit algorithms like OMP and 
CoSaMP, which are aimed at minimizing the residual error 
alone. 

B. Thresholding-like MAP 

To simplify computations, we can consider a thresholding- 
like version of Algorithm Q] Again we start with an empty 
support and compute Val(i) using (|T?)) for i = 1, . . . , m, just 
as we do in the first iteration of Algorithm Q] We then sort the 
indices according to Val(i) in a descending order and consider 
to, candidate supports for solving the MAP estimation problem, 
where the kth candidate consists of the first k elements in 
the above order. Among these supports we choose the one 
that maximizes the posterior probability Pr(S\y). A pseudo- 
code for the proposed thresholding-like algorithm is given in 
Algorithm 

Algorithm 2 Greedy thresholding-like algorithm for approxi- 
mating the MAP estimator of ( fTUb 

Input: Noisy observations y G K™ and model parameters 

W,b, {<7x,i\7=l >A,<r e . 

Output: A recovery smap for the support, 
for i G {1, . . . , to} do 

s = i , 

-1 , j^* 

K 1 , 3 = i 
Evaluate Val(i) using ([T4]) . 
end for 

Sort Val(i) in a descending order and arrange the indices 
1, . . . , m according to this order, 
for k G {1, ... , to} do 

Set to include the first k elements in above order. 

Compute Pr (s^ \y). 
end for 

fc* = argmax fc {Pr (s( fe ) \y) } 
Return: 5 MAP = 

C. Random OMP-like MMSE 

Another alternative is using a randomized version of Al- 
gorithm [U which approximates the MMSE estimate. The 
algorithmic framework remains the same as before, except 
for two changes. First, instead of adding to the support the 
element that maximizes Val(i) in each iteration, we make 
a random choice with probabilities -J- exp{Val(i)} for all 
the candidates i, where Z\ is a constant that normalizes the 
probabilities. Second, we perform J runs of this algorithm 
and average the resulting sparse representations {x®}^-, that 
are computed using ( fTTT i to obtain the final estimate for x, A 



S[3] 



pseudo-code for the proposed randomized greedy algorithm is 
given in Algorithm [3] 

Algorithm 3 Randomized version of Algorithm Q] for approx- 
imating the MMSE estimator of (fT2l) 

Input: Noisy observations y G K™, model parameters 
W, b, {cr Xy i}™ =1 , A, a e and number of runs Jo. 

Output: A recovery £mmse f° r me representation vector, 
for I = 1 to J do 

x (i) = o™>< 1 

s£ = 0, s° = -i mxl 

k = 1 
repeat 

for i s^ 1 do 

s k = s*~ x U i 

v 1 i 3 = i 

Evaluate Val(i) using (fl4|. 
end for 

Choose i* at random with probabilities 
exp{Val(i)}. 



S k [3] 



1 



s h ; = s k - 1 uz„ s k [j] = 
k = k + i 

until Pr(s*|y) < Pr(s k - 1 \ y ) 



3= % * 



„fe-i 



Compute x. using (fTTT >. 
end for 

Jo 

Return: .t mmse = 4- E x(0 



D. Related Pursuit Methods 

To conclude this section, we mention some related works. 
First, note that for W = and equal biases 6j for all i, 
which correspond to an i.i.d. prior, the proposed algorithms 
resemble the fast Bayesian matching pursuit suggested in 
|[T4ll . Second, the recent work of ||251 used a BM-based 
Bayesian modeling for the sparse representation to improve 
the CoSaMP algorithm. The inherent differences between our 
greedy approach and the one suggested in [25] are explained 
in Section IXl 



VI. Exact MAP estimation 

A. Model Assumptions 

In this section we consider a simplified setup where exact 
MAP estimation is feasible. A recent work IfTSI treated the 
special case of a unitary dictionary for independent-based 
priors, and developed closed-form expressions for the MAP 
and MMSE estimators. We follow a similar route here and 
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assume that the dictionary is unitary. Q In this case we can 
make a very useful observation which is stated in Theorem [1] 
A proof of this theorem is provided in Appendix [A] 

Theorem 1: Let A be a unitary dictionary. Then the BM 
distribution is a conjugate prior for the MAP estimation 
problem of (fTOb . namely the a posteriori distribution Pi(S\y) 
is a BM with the same interaction matrix W and a modified 
bias vector q with entries: 



(y 1 



in 



(16) 

for all i, where ai is the ith column of A. 

Notice in ( fT6l ) that is linearly dependent on the original 
bias hi and quadratically dependent on the inner product 
between the signal y and the atom a^. This aligns with the 
simple intuition that an atom is more likely to be used for 
representing a signal if it has an a priori tendency to be 
turned "on" and if it bears high similarity to the signal (this 
is expressed by a large inner product). From Theorem Q] the 
MAP estimation problem of (fTOb takes on the form of integer 
programming. More specifically, this is a Boolean quadratic 
program (QP): 

maximize (q T S + ^S T WSj s.t. Sf = 1, 1 < i < m. (17) 



This is a well-known combinatorial optimization problem 
that is closely related to multiuser detection in communication 
systems, a long-studied topic [ 30 1 . The Boolean QP remains 
computationally intensive if we do not use any approximations 
or make any additional assumptions regarding the interaction 
matrix W. The vast range of approximation methods used for 
multiuser detection, like SDP relaxation, can be adapted to our 
setup. Another approximation approach, which is commonly 
used for energy minimization in the BM, is based on a Gibbs 
sampler and simulated annealing techniques IfTTl . Our interest 
here is in cases for which simple exact solutions exist. We 
therefore relax the dependency model, namely make additional 
modeling assumptions on W. 

We first consider the simple case of W = 0, which 
corresponds to an independency assumption on the entries of 
S. Using Theorem Q] we can follow the same analysis as in 
Section IIII-BI for W = by replacing the bias vector b by q. 
Consequently, in this case we have: 



Pr(%) = I] P^ly), 



(18) 



where Pr(5i = l\y) = 1 /(i+oxp(-2g i )) for all i. Notice that 
Pr(S t = l\y) > Pr(5i = -l\y) if q l > 0. This means that 
the ith entry of S MAP equals 1, namely i is in the support, 

1 In this context we would like to mention that assuming a unitary dictionary 
A is equivalent to the case z = x + to, where there is no dictionary, namely 
A is the identity matrix, and we have noisy observations of a signal with 
a BM prior. To see that, multiply each of the sides in the signal equation 
y = Ax + e by A T . In the resulting equation A T y = x + A T e, the noise 
w = A T e has the same distribution as the original noise e and A T y is 
the transformed signal. We would like to thank Prof. Phil Schniter for this 
constructive observation. 



if qi > 0. Using (fTBT l we obtain the following MAP estimator 
for S: 



1. 



\y T a,\ > 



'111 



1, 



nA 1 



(19) 



otherwise 



where pi is defined in © and = y ^ a l,i/(al i +a^). These 
results correspond to those of |fT31 for the MAP estimator 
under a unitary dictionary. 

To add dependencies into our model, we may consider two 
approaches, each relying on a different assumption on W . 
First, we can assume that all entries in W are non-negative. If 
this assumption holds, then the energy function defined by the 
Boltzmann parameters W, q is regarded "sub-modular" and it 
can be minimized via graph cuts ll3~Tl . The basic technique 
is to construct a specialized graph for the energy function 
to be minimized such that the minimum cut on the graph 
also minimizes the energy. The minimum cut, in turn, can 
be computed by max flow algorithms with complexity which 
is polynomial in m. The recent work E51 is based on this 
approach and we will relate to it in more detail in Section |X] 

Here we take a different approach, which seems to be more 
appropriate for our setup. This method makes an assumption 
on the structural component of the MRF - we assume that 
the BM is decomposable with a small tree-width. This type 
of MRF was explored in detail in Section IIII-CI The above 
assumption implies that the matrix W has a special sparse 
structure - it corresponds to a chordal graph where the size 
of the largest maximal clique is small. As we have seen in 
Section ITlI-CI decomposable models can serve as a very useful 
relaxation for general dependency models. Another motivation 
for this assumption arises from the results that were shown 
in Section [TT] for the special case of image patches and a 
DCT dictionary. It was shown there that independency can be 
considered a reasonable assumption for many pairs of DCT 
atoms. This observation has the interpretation of a sparse 
structure for the interaction matrix W. Consequently, it seems 
plausible that a matrix W with a sparse structure can capture 
most of the significant interactions in this case. 

From Theorem Q] it follows that if the above assumption 
on the structure of W holds for the BM prior on S it also 
holds for the BM posterior (since both distributions correspond 
to the same interaction matrix). We can therefore use belief 
propagation techniques to find the MAP solution. We next 
present in detail a concrete message passing algorithm for 
obtaining an exact solution to ( flTt under a banded W matrix. 

To conclude this subsection note that the use of belief 
propagation techniques [26] has recently become very popular 
in the sparse recovery field 0321 . 11331 . 1341 . However, these 
works provide a very limited treatment to the structure of the 
sparsity pattern. We will relate in more detail to these recent 
works and emphasize the contribution of our work with respect 
to them in Section [X] 

B. The Message Passing Algorithm 

Before we go into the details of the proposed message 
passing algorithm, we make a simple observation that will 
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simplify the formulation of this algorithm. As we have seen 
in Section IIII-BI a posterior BM distribution with parameters 
W, q can be written (up to a normalization factor which has 
no significance in the MAP estimation problem) as a product 
of potential functions defined on the maximal cliques in the 
corresponding graph: 



exp(q T S + -S T WS) =Y[V Ct (S, 



(20) 



where P is the number of maximal cliques. By replacing 
the potentials {^>d (Sd)} with their logarithms, which are 
denoted by |^c 4 (ScJ j> we remain with quadratic functions 
of the variables of {Si} i=1 : 



S T WS + q T S = J2 (S Ct ) 



(21) 



This can be very useful from a computational point of view 
as there is no need to compute exponents, which can lead to 
large values. Each product that appears in a standard message 
passing algorithm is replaced by summation. 

For concreteness we will focus on the special case of an 
Lth order banded in teracti on matrix W of size m-by-m, as 
described in Section IIII-CI In this case the maximal cliques 
are d — {Si, . . . , Si+z,} , i = l,...,m—L, so that all 
cliques are of size L + l and the tree-width is L. The clique 
tree takes the form of a simple chain of length m — L. We 
denote the "innermost" clique in this chain by Cfe, where 
k = \( m -- L - 1 )/2~\. We choose an order for the cliques where 
the cliques at both edges of the chain appear first and the 
"innermost" clique appears last and set the clique pot entials 
according to the rule of thumb mentioned in Section IIII-BI 
Consequently, the logarithms of the potentials are given by: 



1 < i < k - 1 
i = k 



l=i+l 

k+L k+L-1 k+L 

E US* + E E WjiSjSi 
j=k j=k t=j+i 

i + L-1 

q i+L Si + L + E Wi }i+L SiS i+ L , k+l<i<m-L 

(22) 

i&Ci is a function of Si, ... , Si+L ■ We pass messages "in- 
wards" starting from C% and C TO _£ until the clique Cfe 
receives messages from both sides: 



max i&c- 



1 



mji+i = < ~" ~ (23) 
max i&Ci + rrii-ij , 2 < i < k — 1 



m,iA-i 



max i i = m — L 

Si + L ^ 

max + n^i+ij j m — L — l<i<k + l 

. S i + L 



The arguments that correspond to each of the maximization 
operators are denoted by i = l,...,k — 1 and 

i = k + 1, . . . ,m — L (these have the same form 
as the messages with "max" replaced by "argmax"). Note that 
TO M+ i, ^m+i depend on Sj+i, . . . , S i+L and m*,i_i, 
on Si, . . . , Si+L-i- The MAP estimates are then computed 



recursively by: 

(Ski • • • ' s k+i) = argmax $ Cfc + m k -i,k + m k +i,k 

S* = $ Vi+1 (S* +1 , S* +L ) , * = fc-l 1 (24) 

S* +L = (St,.. ■,S* +L _ 1 ) , i = k + 1, . . . ,m — L. 

The message passing algorithm in this case is summarized in 
Algorithm 0] 

Algorithm 4 Message passing algorithm for obtaining the 
exact MAP estimator of (TlOb in the special case of a unitary 
dictionary and a banded interaction matrix 
Input: Noisy observations y and model parameters 

W,b,{a Xi i}™ =1 ,A,<j e . A is unitary and W is an Lth 

order banded matrix. 
Output: A recovery S'map for the sparsity pattern of x. 
Step 1 : Set the bias vector q for the BM posterior distribution 
Pr(%) using (TJ5]>. 

Step 2: Assign a potential function ^>c'i (Set) for each clique 
Ci = {Si, S i+L } , i = 1, . . . , m - L using (|22]i. 
Step 3: Pass messages "inwards" starting from C\ and C to _l 
until the "innermost" clique Cfe receives messages from both 
sides using (1231 . 

Step 4: Obtain the MAP estimate for S using d24l . 

An important observation is that the complexity of the 
proposed algorithm is exponential in L and not in m. More 
specifically the complexity is 0(2 L ■ m). As the value of 
L is part of our modeling, even when m is relatively large 
(and the exhaustive search which depends on 2 m is clearly 
infeasible), the exact MAP computation is still feasible as 
long as L remains sufficiently small. If we have for example 
L = 7log 2 (m) then the complexity is 0(m 1+7 ), namely it is 
polynomial in m. 

VII. Simulations on Synthetic Signals 

In this section we assume that all the parameters of the 
BM-based generative model are known and use this model 
to create random data sets of signals, along with their sparse 
representations. A standard Gibbs sampler is used for sampling 
sparsity patterns from the BM. The sampled supports and 
representation vectors are denoted by {s^ l \x^ "}._■,• Using 
these synthetic data sets, we test the recovery algorithms that 
were proposed in the two previous sections (see Algorithms 
[TJIUi and compare their performance to that of two previous 
sparse recovery methods. 

The first method is OMP, a standard pursuit algorithm, 
which serves as the default option that one would use for 
sparse approximation when no information is given about the 
structure. The OMP algorithm is used only for identifying 
the support. Then the recovered support is used to obtain an 
estimate for the representation vector using ( [TTl i. just as the 
MAP estimators. The second is an approximate MAP estimator 
that is based on Gibbs sampling and simulated annealing as 
suggested in ll24l . Since we do not have access to the code 
and parameters of the algorithm from 11241 . our implementation 
is not exactly the same. Rather, we chose to set the number 
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of rounds for the Gibbs sampler so that its computational 
complexity is roughly the same as the OMP-like MAP method 
(see Algorithm [TJ. This choice was made after exploring the 
influence of the number of rounds for the Gibbs sampler on its 
performance. We observed that if we increase the number of 
rounds by a factor of 10 with respect to the above suggestion, 
then performance improves only slightly. This behavior is 
associated with the slow convergence of the Gibbs sampler. 
As for annealing, we used a geometric schedule: T^+i = (3Tk, 
where 7\. is the "temperature" used in the fcth round. The initial 
"temperature" is set to be high (around 600) and (3 satisfies a 
final "temperature" of 1. 

We begin by examining a setup that satisfies the simplifying 
assumptions of Section |VT] We assume that A is an m-by- 
m unitary DCT dictionary with to = 64 and that W is a 
9th order banded interaction matrix. The values of the model 
parameters are in the following ranges: [—1,1] for the nonzero 
entries in W, [—3, —2] for the biases 1 and [15, 60] 

for the variances {cx,t }.-_!• In this case we can apply all of 
the algorithms that were suggested in this paper. However, 
for concreteness we chose to apply here only Algorithms 1 1 1 3 1 
and [4j leaving Algorithm [2] for the second set of synthetic 
experiments. In Algorithm [3] we performed J = 10 runs of 
the random greedy pursuit. 

We compare the performance of the five algorithms for 
different noise levels - er e is in the range [2,30]. For each of 
the above-mentioned algorithms we evaluate two performance 
criteria. The first one is the average normalized error in 
identifying the true support: 



1 



1 



N 



s C0naW| 



JV^max(|s|,|s|)" 



(25) 



Note that for the random greedy algorithm we evaluate the 
support error using the indices of the k largest coefficients 
(in absolute value) in the obtained solution x as the recov- 
ered support s. The second criterion is the relative recovery 
error, namely the mean recovery error for the representation 
coefficients normalized by their energy: 
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The relative error is also evaluated for the Bayesian oracle 
estimator, namely the oracle which knows the true support. 
Note that for a unitary dictionary the relative error for the 
representation coefficients is in fact also the relative error for 
the noise-free signal, since = f° r an Y vector u. 

The results appear in Fig. |4] 

Several observations can be made from the results in Fig. 
|4] First, all BM-based pursuit methods outperform the OMP 
algorithm. Notice that the message passing algorithm (exact 
MAP) performs well and the performance of the OMP-like 
algorithm is not too far off. Second, the OMP-like MAP 
outperforms Gibbs sampling, for the same computational 
complexity. Finally, the randomized version of the OMP- 
like method obtains a recovery error which is roughly the 



same as exact MAP (recall that the random greedy algorithm 
approximates the MMSE estimator). 

We now provide some additional observations that were 
drawn from similar sets of experiments which are not shown 
here. We observed that the performance gaps between the exact 
MAP and its approximations are associated more with the 
"strength" of the interactions than with the average cardinality. 
When we tested higher (less negative) biases and weaker in- 
teractions, so that the average cardinality remains roughly the 
same, the approximations align with the exact MAP (except 
for the Gibbs sampler which still performs slightly poorer). As 
for higher noise levels, we noticed that all algorithms exhibit 
saturation in their performance. In this setup the OMP tends to 
choose an empty support. The convergence criterion for OMP 
is \\y — A s a; s ||2 < r\\pn<j e , where 77 is a constant which is 
close to 1. This is a standard criterion used for denoising with 
OMP. When <r e is large, it happens often that the OMP stops 
before using any atom. 

Next, we turn to the case of a redundant dictionary and a 
general (non-sparse) interaction matrix. We use the 64-by-256 
overcomplete DCT dictionary. All the rest of model parameters 
are the same as before, expect for the interaction matrix which 
is no longer banded and its values are in the range [—0.1, 0.1]. 
For this setup exact MAP estimation is no longer possible 
and we can use only the greedy approximations for MAP 
and MMSE (see Algorithms |T][3j. We evaluate the average 
normalized error in the support (|25T > and the relative recovery 
error with respect to the noise-free signal: 
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The results are shown in Fig. [5] We see that both the OMP-like 
MAP and the Gibbs sampler outperform the OMP algorithm. 
However, there is a small performance gap in favor of the 
OMP-like MAP. In terms of the recovery error, we can see that 
this performance gap increases with the noise level. Notice that 
the randomized version of the OMP-like method achieves only 
a slightly better recovery error with respect to the original one. 
Finally, the thresholding-like method is the worst for noise 
levels below a e = 10 (even OMP performs better). However, 
as the noise level increases its performance becomes close to 
that of the OMP-like MAP. Consequently, this method seems 
adequate for high noise levels. 

To conclude this section we discuss the use of the three 
suggested greedy algorithms, in terms of their computational 
complexity and recovery quality. The thresholding-like method 
requires the least computational effort - 0(m), but leads to a 
much inferior recovery, compared to the two other methods. 
For the OMP-like algorithm the computational complexity is 
increased by a factor of k - the cardinality of the obtained 
support - but the recovery improves significantly. The recovery 
quality can be further improved using the random OMP-like 
method, however this seems to be a minor improvement given 
the resulting increase in computations by a factor of Jo (10 
in our case). In short, the OMP-like method provides the best 
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Figure 4. Normalized error in identifying the support (25) and relative recovery error i26\ for the 64-by-64 unitary DCT dictionary and a 9th order banded 
interaction matrix. Results are shown for a data set with average cardinality \s\ = 9.8. 




Figure 5. Normalized error in identifying the support )25t and relative recovery error with respect to the noise-free signal d27t for the 64-by-256 overcomplete 
DCT dictionary and a general interaction matrix. Results are shown for a data set with average cardinality \s\ = 10.3. 



compromise to the observed tradeoff between computational 
complexity and recovery quality. 

VIII. Adaptive Sparse Signal Recovery 
In an actual problem suite we are given a set of signals 
{y };— i fr° m which we would like to estimate both the 
sparse representations and the model parameters. We address 
the joint estimation problem in this section by a block- 
coordinate relaxation approach. This approach can be applied 
for both the arbitrary and unitary dictionaries. The only differ- 
ence is in the pursuit algorithm we use. Note that throughout 
this section we will assume that the noise variance a\ is 
known. This is a a typical assumption in denoising setups 
with Gaussian noise. We also assume that the dictionary is 
fixed and known. Dictionary learning is a common practice in 
the sparse recovery field (see for example [35]). However, for 
concreteness we will not address here how to merge dictionary 
learning into the adaptive scheme and we leave this for future 
work. 

A. Model Estimation 

We begin with the model estimation problem. This 
means that we have a data set of i.i.d. examples T> = 



{y^ l \ , }^_ v from which we would like to learn the 



model parameters 



To estimate 9 we 



suggest a maximum likelihood (ML) approach, and using the 
BM generative model we can write: 

m 

Q ML = argmax Pr (X>|6) = argmax ^ C{a 2 x i ) + C{W, b), 



where 
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(28) 
°] (29) 

Nln(Z(W,b)) 
(30) 

are the log likelihood functions for the model parameters. 
This decomposition allows separate estimation of the variances 
Wx }"= i an d the Boltzmann parameters W, b. 

Starting with the variances we have the close-form estima- 
tor: 
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Similar estimators for the variances were also used in Il24l . 

ML estimation of W, b is computationally intensive due to 
the exponential complexity in m associated with the partition 
function Z(W, b). Therefore, we turn to approximated ML es- 
timators. A widely used approach is applying Gibbs sampling 
and mean-field techniques in each iteration of a gradient-based 
optimization algorithm. These methods were used in [24|, 
which is the only work that considered estimating the BM 
parameters for sparsity models. However, we suggest using 
a different approach which seems to be much more efficient 
- MPL estimation. This approach was presented in [36] and 
revisited in 11371 . where it was shown that the MPL estimator 
is consistent. This means that in the limit of infinite sampling 
{N — > oo), the PL function is maximized by the true parameter 
values. 

The basic idea in MPL estimation is to replace the BM 
prior Pr(S\W,b) by the product of all the conditional dis- 
tributions of each node Si given the rest of the nodes S^c: 
nl=i P r (Si\SiC , W, b). Each of these conditional distributions 
takes on the simple form 

Pt(S 1 \S 1 c, W,b) = Cexp{Si (WfS + bi)} (32) 

where is the ith row of W and C is a normalization 
constant. Since this is a probability distribution for a single 
binary node 5, it follows that C = (2 cosh (W? S + h))~ . 
Consequently, we replace Pr(S'|W / , b) by 

e X p{S i {WTS + b i )} 



n*(*i*.*>>=n 

L—l 1=1 V 1 

exp{S* T (WS + b)} 



2™ U?=i cosh {WfS + h) 
We define the log-PL by: 



C P (W, 6) = EE ln ( Pr { S i l) \ S lc ,W,b)) (34) 
i=i j=i 

N _ 

= E { S(l) ) ( WSW + b )- lT P ( WS(l) +b)~rnN ln(2) 

where p(z) = ln(cosh(z)) and the function p(-) operates on 
a vector entry-wise. To explore the properties of the log-PL 
function it is useful to place all the Boltzmann parameters 
- there are p — (m 2 +my 2 unknowns (0 2 -" l )/2 in the upper 
triangle of W and m in b) - in a column vector u. For each ex- 
ample in the data set we can construct matrices 
so that B^u = (S"W) T (WS® + b) and C^u = WS^+b. 

Using these notations the log-PL function of ( |34l can be 
re-formulated as: 
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The gradient and the hessian of C p {u) are given by: 
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where p'(z) = tanh(z) and p"{z) = 1 — tanh 2 (z). Since p{z) 
is a convex function, it follows that the log-PL function is con- 
cave in u. Therefore, as an unconstrained convex optimization 
problem, we have many reliable algorithms that could be of 
use. 

In (37 1 MPL estimation is treated by means of gradient 
ascent (GA) methods. These methods are very simple, but 
it is well-known that they suffer from a slow convergence 
rate [38 1. Another optimization algorithm which converges 
more quickly is Newton [38 1. Note however that the problem 
dimensions here can be very large. For example, when m = 64 
as in an 8-by-8 image patch and a unitary dictionary, we 
have p = 2080 unknown parameters. Since Newton iterations 
requires inverting the Hessian matrix, it becomes computa- 
tionally demanding. Instead we would like to use an efficient 
algorithm that can treat large-scale problems. To this end we 
suggest the sequential subspace optimization (SESOP) method 
[39 1, which is known to lead to a significant speedup with 
respect to gradient ascent. 

The basic idea in SESOP is to use the following update rule 
for the parameter vector in each iteration: 



(38) 



where is a matrix consisting of various (normalized) 
direction vectors in its columns and a! 1 is a vector containing 
the step size in each direction. In our setting we use only 
the current gradient g- 7 = VCpfa) and M recent steps 
p 1 = u z ~ u 1 - 1 , i = j — M,...,j — 1, so that H j 
is a p-by-M + 1 matrix for sufficiently large j. We use 
the abbreviation SESOP-M for this mode of the algorithm. 

(33) 

The vector a 3 is determined in each iteration by an inner 
optimization stage. Since we use a small number of directions, 
maximizing C p {u 3+1 ) with respect to a J is a small-scale 
optimization problem and we can apply Newton iterations 



to solve it, using V a jC p {u 



v 2 3 



{Hi) VC p (uJ +1 ) and 
£ p (ui +1 ) = (W) r V 2 £ p (uJ +1 )W. 
To initialize the algorithm we set the interaction matrix to 
zero, namely we allow no interactions. We then perform a 
separate MPL estimation of b where W is fixed to zero, which 
results in 

r 1 



atanh 



JV 
i=i 



s 



(39) 



for all i. We stop the algorithm either when the norm of 
the gradient vector VC p (u) decreases below a pre-determined 
threshold e, or after a fixed number of iterations J\. A pseudo- 
code that summarizes the learning algorithm for the Boltzmann 
parameters is provided in Algorithm [5] 

To demonstrate the effectiveness of MPL estimation via 
SESOP, we now show some results of synthetic simulations. 
We use a Gibbs sampler to generate = 16, 000 support 
vectors from a BM prior with the following parameters: W is a 
9th order banded matrix of size 64-by-64 with nonzero entries 
drawn independently from U [—0.5,0.5] and b is a vector of 
size 64 with entries drawn independently from Af (— 1.5,1). 
We then use these supports as an input for the learning 
algorithm and apply 50 iterations of both GA and SESOP-2 
to estimate the Boltzmann parameters. The results are shown 
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Figure 6. Top - results of MPL estimation via GA and SESOP: The value of the log-PL objective and the average recovery error for the interaction matrix 
per entry as functions of the number of iterations. Middle (from left to right): The true interaction matrix W and MPL estimate via GA Wga - Bottom (from 
left to right): MPL estimate via SESOP Wsesop, a banded version of it and a matrix consisting of the interactions in W which are more likely to be 
revealed using the given data set. We can see that the latter two are very close. 



Algorithm 5 A SESOP-M algorithm for obtaining the MPL 
estimator of the Boltzmann parameters 

Input: A data set of supports {S^}^. 

Output: A recovery W, b for the Boltzmann parameters. 

Initialization: Set W a to zero and b° according to d39l l, and 

construct from them a column vector u°. 

.1=0 
repeat 

Step 1: Evaluate C p (u 3 ) and V£ p (u 3 ) using (f35T>-(f36b. 
Step 2: Set the matrix H 3 using the current gradient 
\7C p (u J ) and M previous steps |u l — j_ M - 
Step 3: Determine the step sizes a 3 by Newton iterations. 

Step 4: u 3+1 = v? +H 3 a 3 . 

3=3 + 1 
until S7£ p (u 3 ) < e or j > J x 
Return: W, b extracted out of v? . 



in Fig. [6] We can see on the top that SESOP outperforms 
GA both in terms of convergence rate of the PL objective 
and recovery error for the interaction matrix. This is also 
demonstrated visually on the middle and bottom, where we 
can see that for the same number of iterations SESOP reveals 
much more interactions than GA. In fact, if we set to zero the 
entries in the true W that correspond to rarely used atoms (i.e. 
if the appearance frequency of atoms i or j is very low then 
we set Wij = 0), we can see that SESOP was able to learn 



most of the significant interactions 0. 

B. Joint Model Estimation and Pursuit 

We now turn to the joint estimation problem, where both the 
sparse representations and the model parameters are unknown. 
We suggest using a block-coordinate optimization approach 
for approximating the solution of the joint estimation problem, 
which results in an iterative scheme for adaptive sparse signal 
recovery. Each iteration in this scheme consists of two stages. 
The first is sparse coding where we apply one of the pursuit 
algorithms that were proposed throughout this paper to obtain 
estimates for the sparse representations based on the most 
recent estimate for the model parameters. If the dictionary 
is unitary and the interaction matrix is banded we apply the 
message passing scheme of Algorithmic Otherwise we use a 
greedy pursuit (see Algorithms QJ3]). This is followed by model 
update where we re-estimate the model parameters given the 
current estimate of the sparse representations. We use (fJTJ for 
the variances and MPL estimation via SESOP (see Algorithm 
[5]l for the Boltzmann parameters. 

For a setup where the interaction matrix is assumed to be 
banded, we suggest performing a post-processing of the MPL 
estimate. More specifically, we define the energy of W as the 
li norm for the entries in the banding zone. The basic idea is to 
perform pairwise permutations in W, namely switch the roles 

2 An atom is labeled as "rarely used" if it is active in less than 0.3% of the 
data samples. This is an arbitrary definition, but it helps in showing that the 
estimated parameters tend to be close to correct. 
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of pairs of atoms, so that the energy will be maximal. A greedy 
approach can be used, so that in each iteration we replace the 
roles of one pair of atoms, where this replacement is optimal in 
the sense of maximizing the energy. The algorithm converges 
when we cannot increase the energy anymore. At this point 
we set all entries located outside the banding zone to zero. 
The suggested post-processing stage serves as a projection 
onto the banding constraint. Note that the estimated biases and 
variances should also be modified to account for the changes 
in the atom roles. 

IX. Simulations on Image Patches 

The paper starts with a motivating example on image 
patches of size 8-by-8 that are extracted out of natural images 
(see Section Q]}, showing that there are overlooked dependen- 
cies. We now return to this very set of patches and show 
that the proposed approach does better service to this data. 
We add white Gaussian noise to these patches and apply the 
adaptive BM-based sparse recovery scheme that was suggested 
in the previous section on the noisy patches. We consider two 
methods that follow this approach. In the first method we fix 
the dictionary to be the 64-by-64 unitary DCT and assume that 
the interaction matrix is 9th order banded. Therefore we use 
message passing (Algorithm [U for the sparse coding stage 
and apply post-processing on the learned model parameters 
to satisfy the banding constraint. The second method uses 
a fixed overcomplete DCT dictionary of size 64-by-256 and 
assumes nothing on the interaction matrix. Here we use OMP- 
like pursuit (Algorithm [TJ for sparse coding. 

To initialize the parameters of the adaptive BM-based meth- 
ods, we set all the variances to 50 2 and use an i.i.d. prior on 
the support, namely Pr(5i = 1) = p for all i. This prior 
is obtained by the Boltzmann parameters W — Q mxm and 
bi = 0.51n(p/(i-p)) for all i. Note that p has the intuitive 
meaning of the ratio k /m where k is our prior belief on the 
mean cardinality of the support. We use a prior belief that 
the average cardinality for image patches is k = 10. We then 
perform two iterations for each of the adaptive schemes. 

Note that we are not suggesting here an improved image 
denoising algorithm, and in contrast to common denoising 
methods, we do not exploit self-similarities in the image (see 
for example l40l ). Therefore our comparison is limited to de- 
noising schemes that recover each patch separately. We focus 
on denoising methods based on the OMP algorithm, since 
this is the standard pursuit algorithm in patch-based image 
denoising schemes, see for example [41 J. For concreteness 
we also avoid here comparing our approach with methods that 
are based on dictionary learning (see for example [35|). For 
a comparison with K-SVD denoising [41] which is based on 
sparse coding via OMP and dictionary learning, see our recent 
paper |42l . 

We will not show here a comparison to other sturctured 
sparsity methods, and for a reason. As we mention in Section 
HI the two most common approaches for structured sparsity are 
block-sparsity and wavelet trees. These two structures have 
been successfully incorporated into standard sparse recovery 
algorithms like CoSaMP, see for example Q. However, in the 



image patches experiments of Section HU the DCT coefficients 
show neither a tree nor a block-sparsity structure. Therefore 
in this case it is not even clear how to determine the block 
structure or tree structure for the pursuit. The alternative is 
to change the dictionary to a wavelet. Note however that the 
common practice in wavelet denoising is to apply the wavelet 
transform on large sub-images, rather than small patches. For 
this reason we feel that it would not have been fair to compare 
our results on small patches with the ones obtained by a pursuit 
method based on wavelet trees. 

We compare our approach to two simple denoising schemes 
which apply the OMP algorithm on the noisy patches using the 
64-by-64 unitary DCT and the 64-by-256 overcomplete DCT 
dictionaries. Throughout this section we use the abbreviations 
"unitary OMP", "unitary BM recovery", "overcomplete OMP" 
and "overcomplete BM recovery" to denote the four methods. 
Average denoising errors per pixel are evaluated for the four 
methods and for 6 noise levels: a e £ {2, 5, 10, 15, 20, 25}. A 
summary of the denoising results is given in Table [I] where 
the best result for each noise level is highlighted. 

These results show that the adaptive BM-based approach 
suggested throughout this paper obtains better denoising per- 
formance on noisy image patches than a standard sparse recov- 
ery algorithm such as OMP. For the unitary DCT dictionary, 
the performance gaps of BM recovery with respect to OMP 
vary from 0.84[<iB] to 1.23[cLB] for the different noise levels. 
When we turn to the overcomplete DCT dictionary, the perfor- 
mance gaps vary from 0.21[dB] to 0.96[dB]. Note that for both 
dictionaries OMP obtains a similar performance, with a slight 
performance gap in favor of the unitary dictionary. As for the 
BM recovery, the message passing algorithm (used for the 
unitary case) outperforms the OMP-like algorithm (used for 
the overcomplete case) for all noise levels, except for er e = 25, 
where the two algorithms exhibit similar performance. This is 
associated, at least in part, with the accuracy of the pursuit 
algorithm: exact MAP for the unitary case versus approximate 
MAP for the overcomplete case. To take full advantage of 
the redundancy in the dictionary, one should use dictionary 
learning. We leave this for future work, where we intend to 
merge dictionary learning into the adaptive scheme, in order 
to benefit from both the BM generative model and a dictionary 
which is better fitted to the data. 

X. Relation to past works 

In this section we briefly review several related works and 
emphasize the contributions of our paper with respect to them. 
We begin with recent works (23), 112411 . [25 1 that used the 
BM as a prior on the support of the representation vector. In 
recent years capturing and exploiting dependencies between 
dictionary atoms has become a hot topic in the model-based 
sparse recovery field. In contrast to previous works like Q, 
|[7l . EH . 1221 which considered dependencies in the form of 
tree structures, 11231 . Il24l . Il25l propose a more general model 
for capturing these dependencies. 

The authors of [231] use a BM prior on the sparsity pattern of 
Gabor coefficients to capture persistency in the time-frequency 
domain. They adopt a non-parametric Bayesian approach 
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Unitary OMP 


Unitary BM recovery 


Overcomplete OMP 


Overcomplete BM recovery 


2 


2.58 


2.24 


2.52 


2.46 


5 


4.79 


4.29 


4.9 


4.69 


10 


7.55 


6.85 


7.72 


7.19 


15 


9.76 


8.81 


9.94 


9.12 


20 


11.64 


10.53 


11.82 


10.71 


25 


13.33 


12.1 


13.49 


12.08 



Table I 

Summary of average denoising results (Root-MSE per pixel). 



and address the estimation problems by MCMC inference 
methods. In their work the Boltzmann parameters are assumed 
to be known and fixed. This is contrast to our work where 
we develop efficient methods for estimating both the sparse 
representations and the Boltzmann parameters. 

The work of 11241 makes use of a BM prior in the more 
general context of a sparse coding model, which is represented 
by a graphical model. They provide a biological motivation for 
this modeling through the architecture of the visual cortex. 
We used exactly the same graphical model in our work (see 
Section HVT>. In [24| MAP estimation of the sparse representa- 
tion is addressed by Gibbs sampling and simulated annealing. 
These techniques often suffer from a slow convergence rate, 
so that the algorithm is stopped before the global maximum 
is reached. In the current work we suggest alternative pursuit 
methods for MAP estimation. As we have seen in the synthetic 
simulations of Section I VIII our suggested pursuit methods 
outperform the one suggested in (24). 

For learning the Boltzmann parameters the authors of ll24l 
suggest Gibbs sampling and mean-field approximations for 
estimating the gradient of the likelihood function in every 
iteration of a GA algorithm. In our MPL-based algorithm we 
suggest a simple update in each iteration, which is based on 
standard convex optimization methods, instead of the more 
computationally demanding Gibbs sampling process required 
in each iteration of the approximate ML algorithm. Our 
evaluations suggest that there is at least a factor of 10 in the 
complexity per iteration, between the Gibbs sampler and a 
plain GA based on our MPL. Since we have added acceleration 
(SESOP), the gap between the two methods is in fact even 
higher, as we will probably need far less iterations. 

Next, we turn to ll25l . This work adapts a signal model like 
the one presented in 11241 . with several modifications. First, 
it is assumed that all the weights in the interaction matrix 
W are nonnegative. Second, the Gaussian distributions for the 
nonzero representation coefficients are replaced by parametric 
utility functions. The main contribution of ll25ll is using the BM 
generative model for extending the CoSaMP algorithm, a well 
known greedy method. The extended algorithm, referred to as 
lattice matching pursuit (LaMP), differs from CoSaMP in the 
stage of the support update in each iteration, which becomes 
more accurate. This stage is now based on graph cuts and 
this calls for the nonnegativity constraint on the entries of W. 
The rest of the iterative scheme however remains unchanged 
and is still based on "residuals": in each iteration we compute 
the residual with respect to the signal and the algorithm stops 
when the residual error falls below a pre-determined threshold. 

There are several fundamental differences between our work 



and the one reported in l25l . First, while LaMP exploits 
the BM-based generative model only in its support update 
stage, this model is incorporated into all of the stages of our 
greedy algorithms, including the stopping rule. Our greedy 
algorithms work for an arbitrary interaction matrix and in 
this sense they are more general than LaMP. Furthermore, 
LaMP requires the desired sparsity level as an input to the 
algorithm. In contrast, our approach assumes nothing about the 
cardinality, and instead maximizes the posterior with respect 
to this unknown. LaMP also makes use of some auxiliary 
functions that need to be finely tuned in order to obtain good 
performance. These are hard to obtain for the generative model 
we are considering. Because of all these reasons, it is hard 
to suggest a fair experimental comparison between the two 
works. 

We now turn to recent works ll32l . ll33ll . |[34l which con- 
sidered graphical models and belief propagation for sparse 
recovery. All of these works represent the sparse recovery 
setup as a factor graph [26| and perform sparse decoding via 
belief propagation. Note however that the first two works use 
the typical independency assumption on the representation co- 
efficients. More specifically, [32] assumes that the coefficients 
are i.i.d. with a mixture of Gaussians for their distribution. 
Hence, the main contribution of these works is exploiting 
the structure of the observations using graphical models. This 
is in contrast to our work where we focus on structure in 
the sparsity pattern in order to exploit dependencies in the 
representation vector. 

The third work [34] suggests exploiting both the structure 
of the observations and the structure of the sparsity pattern, 
using factor graphs and belief propagation techniques. This 
work is actually more general than ours. However, it leaves 
the specific problem that we have handled almost untouched. 
Various structures for the sparsity pattern are mentioned there, 
including an MRF model. However, the main focus in this 
paper is how to efficiently combine the observation-structure 
and pattern-structure. The treatment given for the sparsity- 
pattern decoding is very limited and empirical results are 
shown only for a Markov chain structure. This is in contrast 
to our work where we mainly focus on pattern-structure and 
address the more general setup of an MRF model. 

Finally, our work differs from typical works on model- 
based sparse recovery, in terms of the signal dimensions. Our 
work is limited to signals of low-dimensions. This is why 
we have tested denoising on image patches, where each is of 
low-dimension. This limitation arises from the fact that, as 
a general framework, our BM model requires an interaction 
matrix W which is of size m-bym. When m is too large 
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(beyond ~ 1000), estimating W and working with it become a 
problem. In contrast, tree-based sparse recovery methods, like 
the ones suggested in Q, need to work on high-dimensional 
signals in order to truly benefit from the multi-scale structure 
of the wavelet coefficients. 



XI. Conclusions 

In this work we developed a scheme for adaptive model- 
based recovery of sparse representations, which takes into 
account statistical dependencies in the sparsity pattern. To 
exploit such dependencies we adapted a Bayesian model for 
signal synthesis, which is based on a Boltzmann machine, 
and designed specialized optimization methods for the esti- 
mation problems that arise from this model. This includes 
MAP and MMSE estimation of the sparse representation and 
learning of the model parameters. The main contributions 
of this work include the development of pursuit algorithms 
for signal recovery: greedy methods which approximate the 
MAP and MMSE estimators in the general setup and an 
efficient message passing algorithm which obtains the ex- 
act MAP estimate under additional modeling assumptions. 
We also addressed learning issues and designed an efficient 
estimator for the parameters of the graphical model. The 
algorithmic design is followed by convincing empirical ev- 
idence. We provided a comprehensive comparison between 
the suggested pursuit methods, along with standard sparse 
recovery algorithms and Gibbs sampling methods. Finally, 
we demonstrated the effectiveness of our approach through 
real-life experiments on denoising of image patches. We have 
released a Matlab toolbox containing all of the suggested 
BM-based algorithms for both pursuit and model estimation, 
available at http://www.cs.technion.ac.il/~elad/software/ or 
http://webee.technion.ac.il/people/YoninaEldar/software.html 



Appendix A 
Proof of Theorem[T] 

We show how the assumption that the dictionary is unitary 
can be used to simplify the expression for Pr(5|y). For 
a unitary dictionary we have Aj A s — I for any support 
s. Consequently, for a support of cardinality k the matrix 
Q s — A s + (JgSj 1 is a diagonal matrix of size fc-by- 
k with entries di = 1 + "s/nj,, i — Si,...,Sfc on its 
main diagonal. Straightforward computations show that the 
following relations hold: 



y T A s Q^ATy=Y,d-\y T a i )\ 
ln((det(Q s ))) =X>W 



(40) 



Using the definition of S (Si = 1 implies that i is in the 
support and 5, = —1 implies otherwise), we can replace 
each sum over the entries in the support J^ies v i by a sum 
over all possible entries Y^iiLi I (~W + 1) v i- Consequently, the 



relations in (|40l can be re-written as 

T 

V 



T \2 
CLi) 



Ci + -f 1 s 



A s Qj x A^y=^(Si + l)dr\y 
In ((det(Q.))) =~ jr (Si + 1) In (*) = C 2 + \g T S (41) 



where C\ , C2 are constants and /. g are vectors with entries 
fi = d~ 1 (y T a,i) 2 , gi — ln(<ii) for i — l,...,m. We now 
place the relations of ( |4TT > into the appropriate terms in < TT~0b 
and get: 

l xl (Pr(S\y)) = C 3 +(b+jL-~-^j S+±S T WS 

(42) 

where C3 is a constant. It is now easy to verify that the pos- 
terior distribution Pr(S\y) corresponds to a BM distribution 
with the same interaction matrix W and a modified bias vector 
which we denote by q = b + jtt — - — -' 



1 



Pr(%) = j exp (y S + -S 1 WS 



4 4" 
1 



(43) 



where Z is a partition function of the BM parameters W, q 
which normalizes the distribution. Using the definitions of /, 
g and v we get that ( TToT i holds. 
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